W4: Data Wrangling with Tidy Data, Part 1

Where are we?

Illustration by Allison Horst

How’s it going?Illustration by Allison Horst

Data Science Workflow

Taken from R for Data Science

We start with Transform and Visualize with the assumption that our data is in a nice, “tidy” state.

Our working Tidy Data: DepMap Project

https://depmap.org/

We will work with metadata, mutation, and expression data.frames.

What do you want to do with this data.frame?

Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something.

With Tidy data, we can ponder how we want to transform our data that satisfies our scientific question.

dplyr lets us do data wrangling

Illustration by Allison Horst

When do I use library()?

You should only have to load packages once in your session. So using library(tidyverse) will load most of everything you need.

Six main dplyr functions

Function Name Purpose When
select() Selects sets of columns in df This week
filter() Filters rows in df This week
mutate() Calculate a New Column in df Next Week
group_by()/summarize() Calculate summary statistics across groups Next Week
arrange() Sorts a df by one or more columns Next Week

And Some More!

Function Name Purpose When
_join() Functions to merge two tables together Next Week
|> Operation to build pipelines This Week

Subsetting a dataframe

In the dataframe you have here, which rows would you filter for and columns would you select that relate to a scientific question?

✅ Implicit: “I want to filter for rows such that the subtype is breast cancer and look at the Age and Sex.”

🚫 Explicit: “I want to filter for rows 20-50 and select columns 2 and 8”.

Notice that when we filter for rows in an implicit way, we often formulate criteria about the columns.

How we do it:

library(tidyverse)

metadata_filtered = filter(metadata, OncotreeLineage == "Breast")
breast_metadata = select(metadata_filtered, ModelID, Age, Sex)

head(breast_metadata)

Here, filter() and select() are functions from the tidyverse package.

filter()

metadata_filtered = filter(metadata, OncotreeLineage == "Breast"):

The second argument: a logical indexing vector built from a comparison operator?

But the variable OncotreeLineage does not exist in our environment!

Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable. We can directly refer to the column vector metadata$OncotreeLineage with just OncotreeLineage.

Try filter Out

Try filter() for Sex == "Female":

select()

The input arguments for select() are also data variables.

select(metadata_filtered,    # Our dataset 
       ModelID, Age, Sex)   # Our columns

Try select() out

Add OncotreeLineage to the select() statement:

Keep In Mind

  • select() works on columns
  • filter() works on rows

Combining Operations into a Pipeline

The Common Thing about tidyverse functions

Both filter() and select():

  • Take a data.frame as input
  • Return a data.frame as output

Why Pipes?

When combining multiple functions in one expression, it gets harder to read:

breast_metadata = select(filter(metadata, OncotreeLineage == "Breast"), ModelID, Age, Sex)

Or, this: 🤨

result2 = function1(function2(function3(dataframe)))

Or… 🤕

result = function1(function2(function3(dataframe, df_col4, df_col2), arg2), df_col5, arg1)

Pipes to make nested functions readable

result2 = dataframe |> 
    function1 |> 
    function2 |> 
    function3
result = function1(df_col5, arg1) |>
         function2(arg2) |>
         function3(df_col4, df_col2)

Applying our knowledge

Rewrite the select() and filter() function composition example using the pipe metaphor and syntax.

breast_metadata = metadata |> 
  filter(OncotreeLineage == "Breast") |>
  select(ModelID, Age, Sex)

breast_metadata

Reading Code with Pipes

When I see pipes, I read them as AND THEN:

metadata |>                                   ## I took the metadata data.frame AND THEN
   filter(OncotreeLineage == "Breast") |>     ## I filtered it AND THEN
   select(ModelID, Age, Sex)                  ## I selected columns from it

Why does this work?

graph TD
  A["metadata<br/>(data.frame)"] 

  • Pipes work by assuming the first argument is the dataset
  • Our first input is a data.frame
metadata 

Why does this work?

graph TD
  A["metadata<br/>(data.frame)"] --"filter(OncotreeLineage == 'Lung')"--> B["filtered_metadata<br/>(data frame)"]

  • Pipes work by assuming the first argument is the dataset
  • We input our data.frame into the first function:
metadata |>
   filter(OncotreeLineage == "Breast")

The output at this point is a data.frame.

Why does this work?

graph TD
  A["metadata<br/>(data.frame)"] --"filter(OncotreeLineage == 'Lung')"--> B["filtered_metadata<br/>(data frame)"] 
  B --"select(ModelID, Age, Sex)"--> C["selected_metadata<br/>(data frame)"]

  • Pipes work by assuming the first argument is the dataset
  • We input our data.frame into the first function:
metadata |>
   filter(OncotreeLineage == "Breast")
  • The output at this point is a data.frame, which means we can feed it into our next function:
metadata |>
   filter(OncotreeLineage == "Breast") |>
   select(ModelID, Age, Sex)
  • The output at this point is also a data.frame.

Tip for building pipelines

Look at the output at each step using head() before you move on!

metadata |> 
  filter(OncotreeLineage == "Breast") |>
  head()

Step 2

metadata |> 
  filter(OncotreeLineage == "Breast") |>
  select(ModelID, Age, Sex) |>
  head()

🤠

Try it Out

Build a pipeline that

  • filter(OncotreeLineage == "Lung")
  • select(ModelID, OncotreeLineage, Age)

Try piping the output into head() as you build it up

What’s Next

  • Making new columns in data with mutate()
  • Make summaries with group_by()/summarize()
  • Merging datasets with _join() functions